Dealing with Imbalanced Data using Bayesian Techniques
نویسندگان
چکیده
For the present work, we deal with the significant problem of high imbalance in data in binary or multi-class classification problems. We study two different linguistic applications. The former determines whether a syntactic construction (environment) co-occurs with a verb in a natural text corpus consists a subcategorization frame of the verb or not. The latter is called Name Entity Recognition (NER) and it concerns determining whether a noun belongs to a specific Name Entity class. Regarding the subcategorization domain, each environment is encoded as a vector of heterogeneous attributes, where a very high imbalance between positive and negative examples is observed (an imbalance ratio of approximately 1:80). In the NER application, the imbalance between a name entity class and the negative class is even greater (1:120). In order to confront the plethora of negative instances, we suggest a search tactic during training phase that employs Tomek links for reducing unnecessary negative examples from the training set. Regarding the classification mechanism, we argue that Bayesian networks are well suited and we propose a novel network structure which efficiently handles heterogeneous attributes without discretization and is more classification-oriented. Comparing the experimental results with those of other known machine learning algorithms, our methodology performs significantly better in detecting examples of the rare class.
منابع مشابه
Evaluating the effect of unbalanced data in biomedical document classification
Nowadays, document classification has become an interesting research field. Partly, this is due to the increasing availability of biomedical information in digital form which is necessary to catalogue and organize. In this context, machine learning techniques are usually applied to text classification by using a general inductive process that automatically builds a text classifier from a set of...
متن کاملAn Approach to Imbalanced Data Sets Based on Changing Rule Strength
This paper describes experiments with a challenging data set describing preterm births. The data set, collected at the Duke University Medical Center, was large and, at the same time, many attribute values were missing. However, the main problem was that only 20.7% of the total number of cases represented the important preterm birth class. Thus the data set was imbalanced. For comparison, we in...
متن کاملLearning Bayesian Network Structure Using Genetic Algorithm with Consideration of the Node Ordering via Principal Component Analysis
‎The most challenging task in dealing with Bayesian networks is learning their structure‎. ‎Two classical approaches are often used for learning Bayesian network structure;‎ ‎Constraint-Based method and Score-and-Search-Based one‎. ‎But neither the first nor the second one are completely satisfactory‎. ‎Therefore the heuristic search such as Genetic Alg...
متن کاملA Probabilistic Bayesian Classifier Approach for Breast Cancer Diagnosis and Prognosis
Basically, medical diagnosis problems are the most effective component of treatment policies. Recently, significant advances have been formed in medical diagnosis fields using data mining techniques. Data mining or Knowledge Discovery is searching large databases to discover patterns and evaluate the probability of next occurrences. In this paper, Bayesian Classifier is used as a Non-linear dat...
متن کاملOn dynamic ensemble selection and data preprocessing for multi-class imbalance learning
Class-imbalance refers to classification problems in which many more instances are available for certain classes than for others. Such imbalanced datasets require special attention because traditional classifiers generally favor the majority class which has a large number of instances. Ensemble of classifiers have been reported to yield promising results. However, the majority of ensemble metho...
متن کامل